Add Nemotron Labs Diffusion#310
Conversation
|
I wonder if the Nemotron and LoRA code could be in two PRs? They look independent. |
| private func nlsLlama4AttentionScale( | ||
| start: Int, stop: Int, beta: Float, maxPositionEmbeddings: Int | ||
| ) -> MLXArray { | ||
| let positions = MLXArray(Int32(start) ..< Int32(stop)) |
There was a problem hiding this comment.
use arange(start, stop) instead -- that will be done on the GPU (may not be a big deal)
There was a problem hiding this comment.
Yes this is cleaner and I can make the change.
As for Nemotron and LoRA code as two separate PRs. Yes, they're independent.
- LoRA changes → standalone, mergeable on their own
- Nemotron port → depends on the LoRA changes
I'm happy if that's cleaner.
There was a problem hiding this comment.
A little easier to review smaller pieces and I can get the LoRA change reviewed and merged faster than the nemotron code.
There was a problem hiding this comment.
PR #316 now contains the LoRA changes. This PR will be updated accordingly once that's merged.
There was a problem hiding this comment.
Using arange(start, stop) now.
| /// Block-wise diffusion (parallel) decoding. | ||
| /// | ||
| /// Mirrors `generate(...)` in `modeling_nemotron_labs_diffusion.py` with | ||
| /// `causal_context = False`. The prompt is causally prefilled into a | ||
| /// shared KV cache once. Each block of `blockLength` tokens is initialized | ||
| /// to `mask_token_id` and refined in up to `blockLength` denoising steps: | ||
| /// at each step the block is forwarded bidirectionally, attending against | ||
| /// the cached prefix without re-running it. After each step the cache is | ||
| /// trimmed by `blockLength` so the next step writes over the same slots. | ||
| /// Once a block stabilizes, we run one causal forward over its finalized | ||
| /// tokens to commit their K/V into the cache for subsequent blocks. | ||
| public func diffusionGenerate( |
There was a problem hiding this comment.
Is this a generic technique required for diffusion models? I wonder if we need a new DiffusionLanguageModel protocol and this becomes generic code?
Same question for arGenerate
What happens if you call the standard generate loop with callAsFunction? Are you losing functionality?
There was a problem hiding this comment.
This is the first diffusion model in the repo so hard to say if it's generic. A DiffusionLanguageModel protocol could exist but it would need to be carefully scoped and hard to validate until there's a second or third implementation.
The scoring/commit logic for Nemotron has some quirks, "fallback to a single most confident token if no candidate beats threshold" is a specific decision. A generic protocol may freeze one policy and possibly limit future ports. I'd hold off on this until we have a second model and re-visit.
arGenerate was provided as a way of providing symmetry with diffusionGenerate / linearSpecGenerate. It duplicates the work that LLModel and TokenIterator already do. It can be removed and go through ChatSession, this is what the demo app does so same as generate loop with callAsFunction.
For Diffusion/Self-spec we'd lost almost everything that makes them what they are as it has no idea about bi-directional logits or block-wise denoising.
Removing arGenerate makes sense unless you'd like to keep it for symmetry purposes, perhaps documented and commented as such. Happy either way.
There was a problem hiding this comment.
Yeah, as the first diffusion model I agree we are in uncharted territory. I think adding all of it with a comment explaining why it is that way will be fine. If/when we pick up another diffusion model we can revisit.
There was a problem hiding this comment.
arGenerate commented and documented and also added a comment around DiffusionLanguageModel protocol .
There was a problem hiding this comment.
@davidkoski Changes applied so ready for review.
e0934b4 to
425becc
Compare
Proposed changes
This PR adds support for NVIDIA's Nemotron-Labs-Diffusion model supporting auto-regressive, diffusion and self-speculating modes.
Checklist
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changesnemotron-labs-diffusion.mov